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This paper combines coneepts from phylogeneties with Gaussian process regression, in 
order to allow evolutionary inference for function- valued traits that are correlated through 
phylogeny. By function- valued traits, we mean data objects which are indexed by one or more 
'spatial' co-ordinates, such as organism age or buffer pH level. Examples of function- valued 
I traits include organism mass with age or htness vs pH curves. We provide a nonparametric 

Bayesian model for such data by using Gaussian processes. The model may be used to infer 
ancestral function- valued traits, compare rates of evolution across a phylogeny, or to identify 
the most likely phylogenies consistent with observed data. It contrasts with methods which 
reduce data to summary statistics or a multivariate vector (without spatial co-ordinates) 
CO . and allows us to make inferential statements about ancestral traits themselves. We illustrate 

the use of the method on real data we generated by experiment. 
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O*; I. INTRODUCTION 

.2 

■ In this paper we consider the problem of evolutionary inference for function-valued traits. By 
evolutionary inference we mean statistical inference for data objects related by a phylogenetic tree 
(more specifically, a chronogram), and we use the term function-valued trait to mean a data object 
with a spatial co-ordinate [1]. Since the phylogenetic tree will play the role of time, we refer to 

^ . the co-ordinate of the trait as space. Example traits are curves for ambient temperature versus 

OO I growth rate for caterpillars, heart rhythm time series [2], or spectrograms of audio data. Figure lA 

gives an example of function-valued traits, related by a phylogenetic tree, which we generated by 

■ experiment. As an example application, suppose that we have spectrograms of the calls of several 
t;-^ , bat species. Using the methods described below, we may estimate the phylogeny of the call (which 

\ may differ from inferred genetic phylogenies); alternatively, given a known phylogeny, we may dare 

to make predictions about the sound of an ancestral bat call, or test for equal rates of evolution 
along the phylogeny. 

In order to perform evolutionary inference on function-valued traits, one might attempt to 
^ ' summarize each trait by a symbolic sequence (e.g. the presence or absence of certain characters) 

^ . [3] or by multivariate vectors of continuous characters or summary statistics [4, 5], and then employ 

existing methods for phylogenetic inference. However there are many possible ways to choose such 
a summary, which may not be mutually compatible at the level of statistical models; also, as yet 
unobserved traits may not possess a particular landmark used to generate the summary. In contrast, 
function-valued traits are functions of a spatial co-ordinate and this implies an ordering on their 
values which may be used to capture important patterns of variation in the data [6]. Moreover, 
since the map from trait to summary is many-to-one, the use of summary statistics only allows 
indirect inference for ancestral traits: the inverse problem of inferring the ancestral trait from the 
ancestral summary would still remain. In the following we shall adopt a nonparametric functional 
approach, which does not in principle require either summarizing or constraining function-valued 
data, and allows inference to be made about functions. 

Because of their generality, flexibility and mathematical simplicity, there has been much recent 
interest in the use of Gaussian process priors for Bayesian nonparametric regression. The mono- 
graph [7] gives a comprehensive treatment of their use in univariate regression, and both spatial 
and spatial-temporal Gaussian process models have been considered in the area of geostatistics 



2 



[8, 9]. In this paper we extend spatial-temporal Gaussian process modelling to take account of 
a tree topology, in order to model correlation due to shared evolutionary history. The method 
provides an explicit posterior distribution for ancestral traits, given observations from traits at any 
set of positions along the phylogeny. It can also be used to make inferences about the phylogeny 
or the evolutionary process operating along it. 

Our work is related to that of Felsenstein [4], which gives a method for comparative studies 
of real-valued traits corrected for phylogeny. It has become widely used, together with its exten- 
sions [5, 10], to infer correlations in the evolution of quantitative characters. Our model may be 
regarded as a Bayesian view of this method, extended in two directions. Firstly, each position 
on the phylogeny carries a function- valued (rather than real- or vector- valued) trait. Secondly, in 
[4] individual traits may be interpreted as observations of a Brownian motion indexed by a phy- 
logeny: this is generalised to a space-time Gaussian process indexed by a phylogeny. By extending 
Felsenstein's model to function-valued traits, we hope to partially address his observation that "the 
difficulty [in the multivariate case] is that quantitative characters will evolve at different rates, and 
in a correlated fashion" [4]. 

II. PHYLOGENETIC GAUSSIAN PROCESSES 

A. Inference with Gaussian Processes 

A Gaussian process is a collection of random variables, any finite number of which have a 
joint Gaussian distribution [7]. As a palette of statistical models, the set of Gaussian processes 
is large and flexible. Two examples used in the comparative method are the familiar Brownian 
Motion (Weiner process) and the Ornstein-Uhlenbeck process [11]. They have found application 
in this context as suitable models for continuous characters diffusing through evolutionary time. 
Despite this dynamical description, however, samples from any Gaussian process also have a very 
simple statistical description, as follows. Suppose that a Gaussian process / is observed at a vector 
of co-ordinates L. Then the resulting vector of sample values f{L) can be viewed as just one 
sample from a multivariate Gaussian distribution of dimension equal to |L|, the number of points 
of measurement: f{L) ~ J\f{0,a{L, L)). Here, a{L,L) is the matrix of the covariances between 
all pairs of observation co-ordinates in L (where li G L), and M represents the Gaussian 

distribution, its two arguments being the mean and covariance matrix. Throughout this paper and 
the supplementary material, and solely for ease of exposition, we make the common assumption 
that the distribution mean is zero. The only choice we have when specifying a Gaussian process is 
thus how the sample values covary, which is encoded by the covariance function a. Since the only 
restriction is that a must be a positive semidefinite and symmetric function, Gaussian processes 
provide flexible models for a wide class of random processes. In practice, one typically assumes 
a particular form for a, which may depend on hyperparameters 9. These hyperparameters often 
have interpretations as characteristic length scales, strengths of interaction between coordinate 
directions, and/or error standard deviations. The log-likelihood of the sample f{L) is then 

logp{f{L)\e) = f{Lfa{L,L,e)fiL) - log(det(a(L, L, 0))) - ^ log27r, (1) 

where we have made the dependence of cr on explicit. 

Bayesian prediction is analytically tractable when assuming a Gaussian process prior distribu- 
tion for a random function /. We might be interested in making inferences about the unobserved 
values of our random function / at a vector M of co-ordinates, given samples at the co-ordinates 
L. The posterior distribution of the vector /(M) given f{L) is also Gaussian and of the form: 

f{M)\f{L)^M{A,B) (2) 
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where 



A = a{M,L)a{L,L)-'f{L), 

B = a{M) - a{M,L)a{L,L)-^a{M,Lf 



(3) 
(4) 



and a(M,L) denotes the \M\ x \L\ matrix of the covariance function a evaluated at all pairs 
rrii G M,lj G L. Prom Eq. (3), the posterior mean vector A consists of linear combinations of 
the observations. Note that the posterior covariance matrix B, given by (4), is independent of 
the observations. Gaussian process regression is nonparametric in the sense that no assumption is 
made about the structure of the model: the more data gathered, the longer the vector f(L), and 
the more intricate the posterior model for f{M). This is the sense in which the data is not being 
summarized in order to perform inference. 

Though the Wiener process is one-dimensional, Gaussian processes can have arbitrary dimen- 
sion. In the context of our model, this is important because we can consider each point of observa- 
tion £ as corresponding to a point (x, t) in both space and evolutionary time. Here we define space 
as the co-ordinate of a function- valued-trait, as above. As an example, if the spatial coordinate 
is one-dimensional then f{L) would represent the values of a random two-dimensional function at 
various space-time co-ordinates. A slice through this function at a fixed value of t can be viewed 
as a one-dimensional function-valued trait at the fixed time t in its evolution (with straightfor- 
ward generalizations to higher-dimensional traits) . In the next section we will extend this view to 
processes on phylogenies. 



We now provide some formal details for the construction of a Gaussian process model for 
function- valued traits indexed by a phylogeny, T. Readers preferring examples early on might 
skim the following and move to the next subsection and then the illustrative applications. 

We suppose each observation i corresponds to a point (x,t) in the space-phylogeny S xT: that 
is, X e S is the value under consideration of the spatial (trait) coordinate, and t G T is the point 
under consideration on the phylogeny (t is not just a time co-ordinate but also indicates a branch 
of the phylogeny). Our aim in the following is to build a model for the evolution of a function- 
valued trait along this branched phylogeny. We will do this by constructing a covariance function 
^T{^i,£j) when each Zj is a point in 5 x T. There are two assumptions from which our covariance 
function will follow (it should be noted that these assumptions are consistent with [4]): 

Assumption II. 1. Conditional on their common ancestors in the phylogenetic tree T, any two 
traits are statistically independent. 

Assumption II. 2. The statistical relationship between a trait and any of its descendants in T is 
independent of the topology o/T. 

Assumption II.2 means that our statistical model of the evolutionary process is identical along 

paths through T from the root to any tip, and we call this the marginal process. It should be noted 
that Assumption II. 2 is made for ease of exposition in this paper and that it may be relaxed, for 
example to model unequal rates of evolution along different branches of T. We have assumed that 
T is a chronogram, and we will denote the date of a point t G T by the plain typeface symbol t. It 
is easy to see that the marginal process is Gaussian, and so it is sufficient to specify its covariance 
function S on 5* x T, where T is the set of all dates in T. 

We call the covariance function St, resulting from Assumptions II.1-II.2, the phylogenetic co- 
variance function. It is shown in the supplement that a large class of phylogenetic covariance 



B. Phylogenetic covariance function 
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functions may be constructed, offering a wide choice of priors for Bayesian evolutionary inference 
with function-valued traits. Some of our mathematical results are summarised in the following 
Proposition. 

Proposition II. 1. 1. If the marginal covariance function E is space-time separable so that it 
can be expressed as 

T,{{xi,ti), {X2,t2)) = K{xi,X2)k{ti,t2), (5) 

then the phylogenetic covariance function St is also space-time separable, i.e. 

T,T{{xi,ti), (X2,t2)) = ii'(xi,a;2)/CT(tl,t2) (6) 

where feT(ti,t2) is the phylogenetic covariance function constructed from the time- dependent 
component k in (5). 

2. When the time- dependent component k of (5) specifies a process that is Markovian in time, 
we have the simple expression 

/CT(tl, ts) = kiti,ti2)k{ti2,ti2y^k{t2,ti2) (7) 

where ti2 is the most recent common ancestor of ti and t2 (and ti2 is its depth in T). 

C. Example 

We close this section by constructing an example of a phylogenetic covariance function St. 
Suppose that we wish to perform evolutionary inference for a set of function-valued traits. The 

simplest possible structure for the marginal covariance function S is a space-time separable func- 
tion. If the traits are smooth, we should choose a spatial component which generates smooth 
random functions: a commonly used choice is the squared exponential covariance function: 

K{xuX2) = exp (-(xi - X2f/2el) . (8) 

We may believe that, conditional on any given trait, its ancestor and progenitor traits are statisti- 
cally independent. This corresponds to choosing a temporal component which is Markovian. The 
only Markovian Gaussian processes are the class of Ornstein-Uhlenbeck processes, and have the 
following covariance function: 

k{ti,t2) = exp(-|ti-t2|/^2). (9) 

From Proposition II. 1 we obtain 

ST((xi,ti),(x2,t2)) = exp(-(xi-X2)V2e?-C?T(tl,t2)/^2) (10) 

where d'r(ti,t2) = |ii — ^12! + |ii2 "^2! is the total amount of evolutionary time which elapses when 
moving through the chronogram T from ti to t2 via their most recent common ancestor ti2. 

As noted above, these simple examples may be combined by summation to construct other 
(non-separable) phylogenetic covariance functions. As an example, in applications where there is 
measurement noise, the following phylogenetic covariance function, S^, contains an uncorrelated 
noise term whose influence is controlled by the choice of the parameter (Jq: 

ST((a;i,ti), (a;2,t2)) = (1 - o-q) exp (-(xi - X2)^/26'i - dT(ti, t2)/6'2) + cro(5ti,t2<5a;i,x2 (H) 
where S is the Kronecker delta. 
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III. ILLUSTRATIVE APPLICATION 

To illustrate the above method with experimental data, we considered a version of the game 
'telephone' (where players are nodes in a directed cycle graph and spoken messages are passed 
between successive neighbours to comic effect) with two differences: the players memorise and 
reproduce a drawing between neighbours, and the players are arranged as nodes in a tree graph 
and pass the message from root to the tips. This data was created by students from The Swinton 
High School and has the virtue that we know the true phylogeny and have functional data at 
internal nodes of the tree (including the root). We model the evolution with Eq. 11 since the 
data has independent errors at each point of observation, is smooth in trait-space (see e.g. the 
functions at the tips in Fig. 1) and, by design, may be assumed to be Markovian and stationary in 
evolutionary time. In the supplement we discuss the choice of hyperparameters 9i, 62, (Tq. For the 
thirty distinct points of observation (ten measurements per function- valued trait) one uses Eq. 11 
with knowledge of the phylogeny T and hyperparameters to construct the thirty-by-thirty matrix 
of covariances S'rp(L, L). 

For illustration we choose the model selection task of inferring the correct tree topology, given 
knowledge of its branch lengths (two leaves are close, at an evolutionary separation of two time 
units from their common ancestor, and one leaf is more distant, at a separation of four units from 
the root). There are three possible trees with three tips and these branch lengths, and each choice 
results in a different phylogenetic covariance matrix E'rp(L,L) for use in model selection. One can 
use Eq. 1 to perform a simple phylogenetic inference and calculate Bayes factors for the three 
pairs of trees. For our choice of hyperparameters, wc find Bayes factors which suggest the actual 
tree is a decisive choice over the other two possible trees (but note that our aim here is only to 
illustrate our method, and not to make a scientific claim about this particular dataset). Given 
the correct phylogeny, one can also calculate the posterior mean and variance of an ancestral trait 
using Eqs. (2,3,4). This requires the construction of S!^(M, L) and S!^(Af, M). These summaries 
of the posterior are shown, for a set of points M at the root of the phylogeny, by the red and blue 
curves in Fig. IB. 

IV. DISCUSSION 

In this paper wc have exploited the powerful inference architecture provided by Gaussian pro- 
cesses to address phylogenetic questions for function-valued data. Explicit posterior distributions 
are available, giving a straightforward approach to the prediction of unobserved function-valued 
traits, as well as a principled approach to evolutionary model selection. The approach is suitable 
both for complete functional observations of fTinction-valued traits and for discretely and even 
irregularly sampled traits, with missing observations. 

V. APPENDIX A: EXPERIMENTS 

A group of student volunteers from The Swinton High School, Salford generated the illustrative 
data we supply. Their desks were arranged in the form of a tree - see Fig. 2. The student 
positioned at the root of the tree was shown a curve, asked to commit it to memory and then, 
immediately after it was removed, reproduce it from memory (by tracing their finger across a tablet 
device). That student's drawing was then shown to his/her (downstream) neighbour in the tree 
who similarly committed it to memory and then attempted to reproduce it. Students located at 
branch nodes of the tree showed their drawings to their two neighbours in turn. In this manner 
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FIG. 1: Sample data generated by experiments with Class llSl. Each piece of functional data was generated 
by a different student. The seating plan of the students was arranged to be in the form of a tree. We display 
the raw data in A) giving the curves drawn by students sitting at the root and the tips. B) Data processing 
and statistical inference. The black curves at the tips show the data extracted, the red curve at the root 
indicates the predicted mean surface given only the data from the leaves, the blue curves are one standard 
deviation uncertainties at each point. The three curves at the top right are example samples from the 
Gaussian process at the root. 

the information at the root of the tree propagated to the tips. 

Our aim is not to reprise existing work in hyperparameter selection (or indeed Gaussian Process 
Regression) but only to illustrate relevant aspects of the method we present - our approach to data 
analysis in this section is thus purely pedagogical. When used in earnest, maximum likelihood 
methods using Eq. 1 could be used to find hyperparameter choices. In the main text we consider 
inferring an ancestor four desks away from the tips (ignoring the curve with which the process 
was initialized); we suppose, by inspection of the data, that correlations in evolutionary time are 
appreciable. We thus choose a value for 62 in the Ornstein-Uhlenbek covariance (Eq. 9 of the 
main text) which is large: in this case thirty desks. Similarly, for the curve drawings, we observe 
that there are appreciable correlations on the lengthscale of around one-third of a curve. We thus 
set the hyperparameter 61 in the squared exponential covariance (Eq. 8 of the main text) to be 
approximately one-third of the image width. The digitisation process we used is crude and we 
suppose that this creates Gaussian distributed independent errors, whose variance represents o"q% 
of the total variance in the data. 

The curves were drawn using vertical guidelines (Fig. 2); these were removed for the purposes 
of analysis. Each line drawn by a student had an appreciable breadth in pixels: a mid-line was 
extracted. The curves were crudely aligned: they were resampled using the MATLAB 'resample' 
function to ensure that they all had the same number of evenly spaced points of observation, and 
normalized so that their minimum values were zero and maximum values were one. Our logic in 
this alignment was that when students copied from one image to the next they were not attempting 
to be accurate as regards horizontal or vertical scaling. 

This experiment was part of the EPSRC funded public engagement project "Mutating Messages 
- A Public Experiment" EP/I0176I5/1 http://mutatingmessages.blogspot.com/. We would like to 
thank especially both the students of Class IISI and their teacher Ellen Pope. We would also like 
to thank Liam Moriarty and Karen Bultitude for their assistance in gathering this data. 




Leaf1 Leaf2 LeafS 



FIG. 2: Sample data generated by experiments with volunteers from Class llSl. Each piece of fmictional 
data was generated by a different student. The seating plan of the students was arranged to be in the form 
of a tree; in this figure the data generated by each student is positioned to correspond to the seating plan. 
Essentially this was a game of 'telephone' but with participants passing on drawings, not sounds, and with 
the participants sitting in positions along a tree graph, not a directed cycle graph. 

VI. APPENDIX B: MATHEMATICAL EXPRESSIONS FOR THE PHYLOGENETIC 

COVARIANCE FUNCTION 

In this section we derive some mathematical expressions for the phylogenetic covariance function. 
Using the terminology of the main text, we first assume that the marginal covariance function is 
space-time separable. We do not, however, assume that the time-dependent component of the 
marginal covariance function is Markovian. We then give an application of these expressions to 
Bayesian inference for function-valued traits: given a complete functional observation, we obtain 
the posterior distribution of this phylogenetic Gaussian process. For necessary background on 
Gaussian process regression and reproducing kernel Hilbert spaces we refer to [7], sections 4.3 and 
6.1. 

1. Notation. 

For convenience we collect here the various definitions that will be used. Again using the 
terminology of the main text, let 

• S" be the internal index set for the function-valued trait 

• T be a phylogenetic tree whose branch lengths represent time (a chronogram), so that each 
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point 6 £ T has a time coordinate, which we will refer to as the date of 6. Let 6i, 02 be 
points in T, and let their most recent common ancestor in T be 9i2, with date ti2- Let ta, tj, 
be respectively the earliest and latest dates of points in T 

• k he a continuous covariance function on [ta^^b] with only finitely many nonzero eigenvalues 
(that is, a degenerate Mercer kernel). Let k^"^ be its restriction to [ta)ii2], and define 

kt^{t2) = k{ti,t2) and kl^{t2) = k^\h,t2) (12) 

Let be the unique reproducing kernel Hilbert space defined by k^'^, with inner product 
(•,-)i2. Let i'^'^ be a Borel measure on [ta)ii2]; and let eP,...,e^^ and AP,...,A^^ be 
respectively the eigenfunctions and (strictly positive) eigenvalues of k^^ with respect to the 
measure v^^, that is 

/ k'\s,w)4\w)dul\w) = APef(.) (13) 

Jta 

/ e\\w)ef{w)dvl\w) = S{i,j) (14) 

Jta 

where 6 is the Kronecker delta. Then by Mercer's Theorem, 

r 

k'\s,w) = ^\fe]\s)e]\w) (15) 

Let y be a Gaussian process on [ta, tf] with covariance function k and Y^'^ be a Gaussian 
process on [ta,ii2] defined by 

^''H = Ev^^^«PH (16) 

i=l 

where the Zi are independent standard Normal random variables. It is easy to check that the 
covariance function of Y^"^ is k^'^. The Gaussian processes Y and Y^'^ therefore have the same 
distribution when restricted to [ta)ii2]- Recalling (12), for ti2 < t < tb, let l^l^{t), . . . ,/iJ^(t) 
be the Fourier coefficients of kt in the basis e^, . . . , ej^: 

f4\t)= f'\t{s)e\\s)dv'\s) 

Jta 

so that 

kt{s) = j2n]\t)ef{s) (17) 
1=1 

• k'r{ti,t2) be the covariance function of the phylogenetic Gaussian process on the phylogeny 
T with marginal covariance function k{ti,t2) (this is a simple phylogenetic Gaussian process, 
which depends on time but has no index for the function-valued trait) 

• -fC be a continuous covariance function on 5" with only finitely many nonzero eigenvalues 
Af , . . . , A;^, and corresponding eigenfunctions ef , . . . , e;^ with respect to a Borel measure . 
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To make this presentation self-contained, we restate here from the main text the two assumptions 
we use to construct a Gaussian process of function- valued traits indexed by T: 

Assumption II. 1. Conditional on their common ancestors in the phylogenetic tree T, any two 
traits are statistically independent. 

Assumption II. 2. The statistical relationship between a trait and any of its descendants in T is 
independent of the topology o/T. 

Our main result is the following expression for the phylogenetic covariance function, which uses 
the Reproducing Kernel Hilbert Space inner product- 
Proposition VI. 1. For 9i, 92 & T we have 

kT{ei,e2) = {kt„kt,) 12 (18) 

i=l i 

Proof. Conditioning on common ancestry. By the definition of A;t and the Law of Iterated 
Expectations we have 

kT{9i,e2) = E[Y{ei)Y{62)] 

= E[E[Y{ei)Y{e2)\Y{e) -.e is an ancestor of 612]] (20) 

Applying conditional independence. Then by assumption 2.1, 

kT{0i,92) = E[E[Y{9i)\Y{9) : 9 is an ancestor 0/^12] x (21) 
E[Y{92)\Y{9) : 9 is an ancestor 0/^12]] (22) 

Conditioning the marginal process. We continue to develop (22) by obtaining an expression for 
the conditional expectation 

E[Y{9i)\Y{9) -.9 is an ancestor of 9 12] 
Lemma VI. 1. With the notation of section VI 1 we have 

E[Y{ei)\Y{9) : 9 is an ancestor 0/^12] = {kt„Y)i2 = 12 ' (^3) 

i=i VXi 

where the final inequality holds in distribution. 

Proof. Because of Assumption 2.2 and the tree structure of T, the statement (23) is a claim about 
the marginal covariance function k. Then for s € [ta, ti2] and ti € [ti2, th], since kl^ is a reproducing 
kernel we have 

{kt^,kl^)i2 = kt,{s) = k{ti,s) (24) 

It follows from Fubini's Theorem that 

E[{kt,,Y)i2Y{s)] = E[Yiti)Y{s)] (25) 

and hence that the conditional expectation of Y{ti) given {Y(w) : w G [^01^12]} is {kt-^,Y)i2, as 
required. Taking the inner product of (16) and (17) (see [7], equation (6.1)) establishes the lemma. 

□ 

To finish the proof, observe that by (21) and (23) we have 

kTi9i,92) = j2 ^'^'^''ll'p^'^ = {h„h,h2 (26) 

1=1 i 

as required. □ 
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A. Proof of Proposition II. 1, part 1 (main text) 

Let kT be phylogenetic covariance function defined above, which depends only on time, and 
let Xi, i = 1, . . . ,n he independent phylogenetic Gaussian processes on T each with covariance 
function fcx- Define a Gaussian process by 

/(x,e) = ^^AfX,(^)ef(x) (27) 

i=l 

In the terminology of the main text, it is easy to check that / is then a phylogenetic Gaussian 
process, indexed by both space and time, with separable covariance function 

n 

^t{{xi,0i),{x2,92)) = A;T(^i,^2)^Vf(a;i)ef(x2) (28) 

i=l 

= kT{ei,e2)K{xi,x2) (29) 

It remains only to check that St has the correct marginal covariance. This follows by choosing 9i 
to be an ancestor of O2 in T, since then by Assumption 2.2 we have A;t(^i,^2) = ^(^1,^2) and so 

ST((iCl,^l),(x2,e2)) = k{h,t2)K{xi,X2) (30) 

= S((xi,ti),(ar2,t2)) (31) 
as required. □ 



B. Non-separable phylogenetic covariance functions 

Since the sum of any two covariance functions is again a covariance function, the space-time 
separable phylogenetic covariance functions obtained in section VI A allow the construction of a 
palette of non-separable phylogenetic covariance functions. Non-separable phylogenetic covariance 
functions also arise in Bayesian inference, if independent Gaussian measurement errors are added 
at the points of observation. A relevant discussion entitled 'making new kernels from old', including 
case studies, is given in [7]. 



C. Functional observations 

Equations (2)- (4) in the main text give the posterior distribution for a phylogenetic Gaus- 
sian process after observations are made at a finite set of space-time points L C 5 x T. When 
whole function- valued traits are observed (rather than samples of function- valued traits at discrete 
points), we have observations at an infinite number of points in 5 x T and so this representation no 
longer applies. In this case, taking the representation (27) as a Gaussian process prior enables us 
to form a posterior distribution. Suppose that the function-valued trait fe is observed at the point 
9 in the phylogeny. Then under the prior distribution, lies in the span of the eigenfunctions 
ef , . . . , e;^ and so has the representation 

n 

1=1 

where the ai{9) may be calculated by taking Fourier coefficients: 

ai{9) = [ fe{w)ef{w)diy^{w) (33) 
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From the representation (27) it follows that the single functional observation /t is equivalent to 
the set of point observations 

Xiid) = ^, i = l,...,n. (34) 

Functional observations of entire function- valued traits at 6i, . . . ,6p therefore correspond to point 
observations of each of the independent phylogenetic Gaussian processes Xi,. . . , Xn at each of the 
phylogenetic points tj. The posterior distributions of the Gaussian processes X^, and hence of the 
original space-time Gaussian process /, then follow by Equations (2)- (4) in the main text. 



D. Markovian case 

We say that S (or, equivalently, St) is Markovian in evolutionary time if, given any G T, the 

function-valued traits at all ancestors of are statistically independent of those at all descendants 
of 9 given the value of the trait /g. If S is time-space separable and its time-dependent component 
k{ti,t2) is Markovian, formula (28) simplifies further to give the explicit expression 

^T{{xi,ti), (X2, ^2)) = K{xi,X2)k{ti,ti2)k{ti2, ti2y^k{t2, ^12) (35) 

A proof may be given similar to the above (with z/^^ the Dirac mass at ii2), but a simpler proof 
is the following. Let g he a. phylogenetic Gaussian process with marginal covariance function k. 
Then by the Law of Iterated Expectations we have 

fcT(ti,t2) = E[E[g{ti)g{t2)\9{ti2)]] (36) 

By the Markov property, the functional traits at times ti and t2 depend on their common ancestry 
only through the Normal random variable g{t\2)- Assumption II. 1 therefore gives that 

A:T(ti,t2) = E[E[g{U)\g{t^2)]E[g{t2)\g{ti2)]\ (37) 

By Assumption II. 2 and standard properties of Normal random variables we have 

E[g{U)\g{t^2)\ = k{ti,ti2)k{ti2,ti2)~'^g{ti2) (38) 

from which the required result follows, after substitution into (37) . □ 
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